As the living quality improved, nowadays many people consider traveling as their first choice to spend their spare time. That should be an unforgettable experience traveling with family and friends, but we often hear bad news about property loss, injury or even death during traveling. Although being careful is important to avoid danger from traveling, choosing a proper destination, safe transportation, reliable agency could also be a good way to protect ourselves.
So what factors might cause the accidents during traveling?
Apply Bayesian analysis on travel insurance dataset.
Travel insurance dataset from Kaggle, Provided by a third-party insurance servicing company based in Singapore.
## Agency Agency.Type Distribution.Channel
## EPX :35119 Airlines :17457 Offline: 1107
## CWT : 8580 Travel Agency:45869 Online :62219
## C2B : 8267
## JZI : 6329
## SSI : 1056
## JWT : 749
## (Other): 3226
## Product.Name Claim Duration
## Cancellation Plan :18630 No :62399 Min. : -2.00
## 2 way Comprehensive Plan :13158 Yes: 927 1st Qu.: 9.00
## Rental Vehicle Excess Insurance: 8580 Median : 22.00
## Basic Plan : 5469 Mean : 49.32
## Bronze Plan : 4049 3rd Qu.: 53.00
## 1 way Comprehensive Plan : 3331 Max. :4881.00
## (Other) :10109
## Destination Net.Sales Commision..in.value. Gender
## SINGAPORE:13255 Min. :-389.00 Min. : 0.00 F : 8872
## MALAYSIA : 5930 1st Qu.: 18.00 1st Qu.: 0.00 M : 9347
## THAILAND : 5894 Median : 26.53 Median : 0.00 NA's:45107
## CHINA : 4796 Mean : 40.70 Mean : 9.81
## AUSTRALIA: 3694 3rd Qu.: 48.00 3rd Qu.: 11.55
## INDONESIA: 3452 Max. : 810.00 Max. :283.50
## (Other) :26305
## Age
## Min. : 0.00
## 1st Qu.: 35.00
## Median : 36.00
## Mean : 39.97
## 3rd Qu.: 43.00
## Max. :118.00
##
Totally 63326 records, 11 variables.
Target: Claim status (YES/NO), Claim status of insurance could indicate whether the customer encountered accident during traveling.
Features: Agency, Agency type, Distribution channel, Product name, Duration, Destination, Net sales, Commission, Gender, Age.
To many empty value in Gender, drop column directly
## [1] 45107
## Agency Agency.Type Distribution.Channel
## EPX :34382 Airlines :17072 Offline: 1077
## C2B : 8080 Travel Agency:43692 Online :59687
## CWT : 7220
## JZI : 6175
## SSI : 1046
## JWT : 740
## (Other): 3121
## Product.Name Claim Duration
## Cancellation Plan :18212 No :59840 Min. : -2.00
## 2 way Comprehensive Plan :12907 Yes: 924 1st Qu.: 9.00
## Rental Vehicle Excess Insurance: 7220 Median : 22.00
## Basic Plan : 5352 Mean : 48.94
## Bronze Plan : 3967 3rd Qu.: 52.00
## 1 way Comprehensive Plan : 3263 Max. :4881.00
## (Other) : 9843
## Destination Net.Sales Commision Age
## SINGAPORE:12958 Min. : 0.07 Min. : 0.000 Min. : 0.00
## THAILAND : 5735 1st Qu.: 19.80 1st Qu.: 0.000 1st Qu.: 35.00
## MALAYSIA : 5643 Median : 28.00 Median : 0.000 Median : 36.00
## CHINA : 4675 Mean : 43.10 Mean : 9.346 Mean : 39.98
## INDONESIA: 3393 3rd Qu.: 49.50 3rd Qu.: 10.500 3rd Qu.: 43.00
## AUSTRALIA: 3316 Max. :810.00 Max. :283.500 Max. :118.00
## (Other) :25044
For the taget “Claim”, we do label encoding, “Yes” to 1, “No” to 0.
##
## 0 1
## 59840 924
For the target, “Claim”
## [1] "Yes:"
## [1] 0.01520637
## [1] "No:"
## [1] 0.9847936
Two levels, 62399 (98.5%) “No”, 927 (1.5%) “Yes”. They are imbalanced, so we do upsampling but the performance become worse.
For the caregorical features,
## [1] 16
## [1] 2
## [1] 147
## [1] 2
## [1] 25
“Agency”: 15 levels, “Destination”: 147 levels, “Product name”: 26 levels.
Too many levels in “Agencys”, “Destination” and “Product name”, then it will generate too many features after one-hot encoding.
“Destination”: 147 levels → Top 10 levels + 1 “Others” level
”Agency”: 15 levels → Keep
“Product”: Due to the correlation with “Agency” → Drop column
## [1] 10
## [1] 147
## [1] 137
As there are too many categorical variable in our dataset, consider the computing power, we will use lasso to do the feature selection and select the top 15 important features.
## [1] "AgencyC2B" "Distribution.ChannelOnline"
## [3] "AgencyLWC" "AgencyTST"
## [5] "AgencyKML" "AgencyRAB"
## [7] "AgencySSI" "AgencyCBH"
## [9] "AgencyCSR" "AgencyCWT"
## [11] "DestinationSINGAPORE" "AgencyJWT"
## [13] "AgencyCCR" "DestinationMALAYSIA"
## [15] "DestinationPHILIPPINES"
We use JAGS to run the robust logistic regression with target “Claim”. Firstly, we need to do one-hot encoding on the categorical features, and select out the top 15 columns selected by Lasso. Then we split the data into training(90%) and testing(10%) dataset. We also check the correlation between features, we can see only AgencyC2B and DestinationSingapore has a little bit higher correlation, but due to actual situation, AgencyC2B client’s destination are all to Singapore, but Client travel to Singapore are not all from AgencyC2B, we decide to keep this two features, and have a look at the result.
## 'data.frame': 60764 obs. of 16 variables:
## $ AgencyC2B : num 1 1 1 1 1 0 0 0 0 0 ...
## $ Distribution.ChannelOnline: num 1 1 1 1 1 1 1 1 1 1 ...
## $ AgencyLWC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyTST : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyKML : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyRAB : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencySSI : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyCBH : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyCSR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyCWT : num 0 0 0 0 0 1 1 1 1 1 ...
## $ DestinationSINGAPORE : num 1 1 1 1 1 0 0 0 0 0 ...
## $ AgencyJWT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AgencyCCR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DestinationMALAYSIA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DestinationPHILIPPINES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Claim : num 0 0 1 0 0 0 0 0 0 0 ...
Finally, we generate the Robust Logistic regression using 2000 burn-in period, 4 chains and 3750 iterations. The distribution of the guessing parameter in Robust Logistic regression shows, its value is very small, which means the model looks very similar to an ordinary logistic regression. And it is significant, 0 is not in the HDI, so we should not ignore the guessing parameter.
From the distribution of posterior, we can see, there are 7 features beta are not significant, in other words, the HDI of the distribution contains “0”, so these features may have beta equals zero, which means these features may not significant when interpreting the target “Claim”. For agency with positive mode like AgencyCWT, indicate travel with this agency may have higher chance to claim travel insurance. For Destination with negative mode, indicate travel to these countries are safe, insurance claim seldom happened.
From each beta’s diagnosis, we can see, beta0 and beta5 are converge very good and have good Effective sample size, which mean these two betas are very stable.
When looking at beta1 and beta11, we can see these two betas are converge very bad, the trace plots are sticky, the autocorrelation are very high, and the ESS are very low. This should be because of the higher correlation of the two features.
Finally, we want to check the performance of the MCMC result. We use the 8 significant variables in MCMC to train the model and compare the result with guessing parameter and without guessing parameter. When we train the model in order to imporve the accuracy, we also do the oversampling to balance our target.
##
## Call:
## glm(formula = Claim ~ ., family = "binomial", data = train_pred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1754 -0.8036 -0.5535 0.6158 1.9758
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.04448 0.05815 17.96 <2e-16 ***
## AgencyC2B 3.36540 0.04182 80.47 <2e-16 ***
## Distribution.ChannelOnline -2.00928 0.05840 -34.41 <2e-16 ***
## AgencyLWC 2.46889 0.04934 50.04 <2e-16 ***
## AgencyTST -1.70233 0.11253 -15.13 <2e-16 ***
## AgencyKML 1.22307 0.07241 16.89 <2e-16 ***
## AgencyCWT 0.67507 0.02279 29.61 <2e-16 ***
## DestinationSINGAPORE -0.83394 0.04080 -20.44 <2e-16 ***
## DestinationMALAYSIA -0.81540 0.03481 -23.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 145134 on 104691 degrees of freedom
## Residual deviance: 111601 on 104683 degrees of freedom
## AIC: 111619
##
## Number of Fisher Scoring iterations: 4
## predict
## real 0 1
## 0 6365 1128
## 1 48 55
## [1] 0.8451817
## predict
## real 0 1
## 0 6365 1128
## 1 48 55
## [1] 0.8451817
From the result we can see, the two results are almost the same, as we know when alpha equals to 0, the model is non-robust, just the logistic regression. when alpha equals to 1, the model is a horizontal line with y intercept 1/2. This means our model is very very close to a non-robust model, this should because the value of our guessing parameter is too small, so it is almost has no influence on our model.
From the correlation plot, we found that the correlation between AgencyC2B and DestinationSINGOPORE is the highest. And the result from the previous MCMC also showed the beta of these two features are not stable, which can see from the four diagnosis plots. Therefore, we decided to drop one of them to see how multicollinearity influences the sample stability, and we choose to keep DestinationSINGOPORE in our next round. After we do the MCMC again, from the diagnosis plots, even the chains still not converge very well, but we can see that the new diagnosis of DestinationSINGOPORE did get improved a lot, the ESS increased and the MCSE decreased.
In conclusion, non-robust Logistic regression is good enough to use on our project. In MCMC only 8 of 15 variables are significant. When there are high correlation variables in a dataset, this will impact the convergence of samples in MCMC.